Introduction¶

The dataset includes information about schools in the NYC district, focusing on variables like total enrollment, percentages of students with disabilities, English learners, and those living in poverty. By analyzing these metrics, I aim to uncover relationships and pose questions that could drive future data-driven strategies for improving equity and resource allocation across NYC schools.

Table of Contents¶

  1. Introduction
  2. Dataset Overview
  3. Data Cleaning and Preprocessing
  4. Exploratory Data Analysis
    • Distribution Analysis
    • Correlation Analysis
  5. Insights and Recommendations
  6. Conclusion

Dataset Overview¶

The dataset consists of 5 key columns:

  • school_name: The name of each school.
  • total_enrollment: The total number of students enrolled.
  • percent_students_with_disabilities: Percentage of enrolled students with disabilities.
  • percent_english_learners: Percentage of students who are English language learners.
  • percent_poverty: Percentage of students living in poverty.

Additionally, a derived column, poverty_level, was created to categorize schools into "Low" or "High" poverty based on their poverty percentage.

Data Cleaning and Preprocessing¶

Describe any steps taken to clean and preprocess the data:

  • Did you (and if so how) handle missing values?
  • Did you do any formatting or transforming variables?
  • Did you removing duplicates or outliers?
  • Did you create new columns (e.g., derived metrics).

Document the reasoning for each cleaning step.

Exploratory Data Analysis¶

In [1]:
# Import libraries
import pandas as pd
import plotly.express as px

# We'll need this to use on Github:
# import plotly.io as pio
# pio.renderers.default = "notebook"
In [2]:
# Read the CSV file into a pandas DataFrame
df = pd.read_csv("datasets/schools.csv")
df.head()
Out[2]:
school_name total_enrollment percent_students_with_disabilities percent_english_learners percent_poverty
0 47 The American Sign Language and English Seco... 176.6 27.56 6.12 86.08
1 A. Philip Randolph Campus High School 1394.8 13.20 9.94 83.70
2 A.C.E. Academy for Scholars at the Geraldine F... 469.8 10.30 8.18 65.50
3 ACORN Community High School 391.0 27.08 4.16 83.88
4 Abraham Lincoln High School 2161.8 15.48 15.88 72.00
In [3]:
# To gain a clear understanding of the dataset's structure, I used school_data.describe() to generate summary statistics for the numeric variables
df.describe()
Out[3]:
total_enrollment percent_students_with_disabilities percent_english_learners percent_poverty
count 1834.000000 1834.000000 1834.000000 1834.000000
mean 588.309481 21.944242 13.284084 75.030873
std 481.135549 15.793045 13.986404 19.179663
min 12.000000 0.000000 0.000000 3.950000
25% 313.250000 14.963750 4.120000 68.065000
50% 476.700000 19.340000 9.070000 79.680000
75% 694.950000 24.415000 17.785000 88.580000
max 5591.800000 100.000000 99.600000 99.380000

For most columns, the mean and median are fairly close, which suggests the data is distributed evenly without extreme values. However, total enrollment is different. Its mean is much higher than the median, and it has a very large standard deviation. This suggests there are schools with extremely low and extremely high enrollment numbers, which are pulling the average up. The minimum and maximum values confirm this, showing a wide range of enrollment sizes, likely with a few schools being outliers.

Distribution Analysis¶

To better understand the distribution above, creating a histogram of total student enrollment in NYC public high schools will allow us to visualize the spread and confirm whether the data is skewed or clustered in specific ranges.

In [4]:
# Plot the data showing the total student enrollment
# Simple
# px.histogram(df, x="total_enrollment")
In [5]:
# Plot the data showing the total student enrollment
# With customizations
fig = px.histogram(
    df, 
    x="total_enrollment", 
    nbins=30,  # Adjust bin size
    title="Distribution of Total Enrollment in NYC Public High Schools",
    labels={"total_enrollment": "Total Enrollment", "count": "Number of Schools"},
    color_discrete_sequence=["#636EFA"],
)

# Add mean and median as vertical lines
mean_enrollment = df["total_enrollment"].mean()
median_enrollment = df["total_enrollment"].median()

fig.add_vline(x=mean_enrollment, 
              line_dash="dash", 
              line_color="purple", 
              annotation_text="Mean", 
              annotation_position="top right",
              annotation_font_color="purple"
             )
fig.add_vline(x=median_enrollment, 
              line_dash="dot", 
              line_color="red", 
              annotation_text="Median", 
              annotation_position="top left", 
              annotation_font_color="red"
             )

fig.update_layout(
    xaxis_title="Total Enrollment",
    yaxis_title="Number of Schools",
    title_font_size=18,
    xaxis=dict(showgrid=True),
    yaxis=dict(showgrid=True),
    margin=dict(l=40, r=40, t=60, b=40)  # Add margins
)

fig.show()

The total enrollment histogram shows a right-skewed distribution, with most schools having enrollments under 1,000 students. A significant proportion of schools have enrollments between 300 and 349 students, forming a clear cluster in this range.

There are several outliers in the data. The main curve of the distribution tapers off around 1,500 students, but a few schools exceed 2,000 or even 3,000 students. Notably, one school stands out with an enrollment of over 5,000 students.

These outliers in the tail of the distribution have a substantial impact on the mean, as it is sensitive to extreme values. In a right-skewed distribution like this, outliers pull the mean higher, making it less representative of the central tendency of the majority of schools. This highlights the importance of considering both the mean and median when analyzing data with significant skewness.

Next, I was curious about the distribution of schools based on the proportion of students with disabilities, the percentage of English Learners, and the percentage of students from Low-Income Families to identify any areas that need more resources or uncover an patterns.

In [6]:
# Visualization showing the % of students with disabilities in schools

Add insights of what your visualizations reveals.

In [7]:
# Visualization showing the % of students who are learning English as a second language in schools

Add insights of what your visualization reveals.

In [8]:
# Visualization showing the % of students whose families are below the poverty line in schools

Add insights of what your visualization reveals.

Correlation Analysis¶

In [9]:
# Scatter plot of percent_english_learners vs percent_poverty

Add insights from findings of scatter plot.

In [10]:
df.corr()
Out[10]:
total_enrollment percent_students_with_disabilities percent_english_learners percent_poverty
total_enrollment 1.000000 -0.176817 0.027771 -0.143091
percent_students_with_disabilities -0.176817 1.000000 0.020820 0.079578
percent_english_learners 0.027771 0.020820 1.000000 0.315477
percent_poverty -0.143091 0.079578 0.315477 1.000000

Add insights from findings from df.corr()

Insights and Recommendations¶

Add final thoughts and recommendations.

In [ ]: